Hotel Reviews Sentiment Analysis
The data was scraped from Booking.com. It contains reviews of hotels present at multiple geographical locations.
This dataset contains 515,000 customer reviews and scoring of 1493 luxury hotels across Europe. Meanwhile, the geographical location of hotels are also provided for further analysis. The csv file contains 17 fields. The description of each field is as below:
Hotel_Address: Address of hotel.
Review_Date: Date when reviewer posted the corresponding review.
Average_Score: Average Score of the hotel, calculated based on the latest comment in the last year.
Hotel_Name: Name of Hotel
Reviewer_Nationality: Nationality of Reviewer
Negative_Review: Negative Review the reviewer gave to the hotel. If the reviewer does not give the negative review, then it should be: 'No Negative'
Review_Total_Negative_Word_Counts: Total number of words in the negative review.
Positive_Review: Positive Review the reviewer gave to the hotel. If the reviewer does not give the negative review, then it should be: 'No Positive'
Review_Total_Positive_Word_Counts: Total number of words in the positive review.
Reviewer_Score: Score the reviewer has given to the hotel, based on his/her experience
Total_Number_of_Reviews_Reviewer_Has_Given: Number of Reviews the reviewers has given in the past.
Total_Number_of_Reviews: Total number of valid reviews the hotel has.
Tags: Tags reviewer gave the hotel.
days_since_review: Duration between the review date and scrape date.
Additional_Number_of_Scoring: There are also some guests who just made a scoring on the service rather than a review. This number indicates how many valid scores without review in there.
lat: Latitude of the hotel
lng: longtitude of the hotel
To do sentiment analysis of reviews
Name: Vidit Kumar Pal, Email: vidit.20.pal@gmail.com, Contact: +91-7985431988
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import plotly.express as px
import emoji
import string
import nltk
from PIL import Image
from collections import Counter
from wordcloud import WordCloud, ImageColorGenerator, STOPWORDS
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import SVC,LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
import pickle
data=pd.read_csv('Hotel_Reviews.csv')
data.head()
| Hotel_Address | Additional_Number_of_Scoring | Review_Date | Average_Score | Hotel_Name | Reviewer_Nationality | Negative_Review | Review_Total_Negative_Word_Counts | Total_Number_of_Reviews | Positive_Review | Review_Total_Positive_Word_Counts | Total_Number_of_Reviews_Reviewer_Has_Given | Reviewer_Score | Tags | days_since_review | lat | lng | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | s Gravesandestraat 55 Oost 1092 AA Amsterdam ... | 194 | 8/3/2017 | 7.7 | Hotel Arena | Russia | I am so angry that i made this post available... | 397 | 1403 | Only the park outside of the hotel was beauti... | 11 | 7 | 2.9 | [' Leisure trip ', ' Couple ', ' Duplex Double... | 0 days | 52.360576 | 4.915968 |
| 1 | s Gravesandestraat 55 Oost 1092 AA Amsterdam ... | 194 | 8/3/2017 | 7.7 | Hotel Arena | Ireland | No Negative | 0 | 1403 | No real complaints the hotel was great great ... | 105 | 7 | 7.5 | [' Leisure trip ', ' Couple ', ' Duplex Double... | 0 days | 52.360576 | 4.915968 |
| 2 | s Gravesandestraat 55 Oost 1092 AA Amsterdam ... | 194 | 7/31/2017 | 7.7 | Hotel Arena | Australia | Rooms are nice but for elderly a bit difficul... | 42 | 1403 | Location was good and staff were ok It is cut... | 21 | 9 | 7.1 | [' Leisure trip ', ' Family with young childre... | 3 days | 52.360576 | 4.915968 |
| 3 | s Gravesandestraat 55 Oost 1092 AA Amsterdam ... | 194 | 7/31/2017 | 7.7 | Hotel Arena | United Kingdom | My room was dirty and I was afraid to walk ba... | 210 | 1403 | Great location in nice surroundings the bar a... | 26 | 1 | 3.8 | [' Leisure trip ', ' Solo traveler ', ' Duplex... | 3 days | 52.360576 | 4.915968 |
| 4 | s Gravesandestraat 55 Oost 1092 AA Amsterdam ... | 194 | 7/24/2017 | 7.7 | Hotel Arena | New Zealand | You When I booked with your company on line y... | 140 | 1403 | Amazing location and building Romantic setting | 8 | 3 | 6.7 | [' Leisure trip ', ' Couple ', ' Suite ', ' St... | 10 days | 52.360576 | 4.915968 |
data.tail()
| Hotel_Address | Additional_Number_of_Scoring | Review_Date | Average_Score | Hotel_Name | Reviewer_Nationality | Negative_Review | Review_Total_Negative_Word_Counts | Total_Number_of_Reviews | Positive_Review | Review_Total_Positive_Word_Counts | Total_Number_of_Reviews_Reviewer_Has_Given | Reviewer_Score | Tags | days_since_review | lat | lng | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 515733 | Wurzbachgasse 21 15 Rudolfsheim F nfhaus 1150 ... | 168 | 8/30/2015 | 8.1 | Atlantis Hotel Vienna | Kuwait | no trolly or staff to help you take the lugga... | 14 | 2823 | location | 2 | 8 | 7.0 | [' Leisure trip ', ' Family with older childre... | 704 day | 48.203745 | 16.335677 |
| 515734 | Wurzbachgasse 21 15 Rudolfsheim F nfhaus 1150 ... | 168 | 8/22/2015 | 8.1 | Atlantis Hotel Vienna | Estonia | The hotel looks like 3 but surely not 4 | 11 | 2823 | Breakfast was ok and we got earlier check in | 11 | 12 | 5.8 | [' Leisure trip ', ' Family with young childre... | 712 day | 48.203745 | 16.335677 |
| 515735 | Wurzbachgasse 21 15 Rudolfsheim F nfhaus 1150 ... | 168 | 8/19/2015 | 8.1 | Atlantis Hotel Vienna | Egypt | The ac was useless It was a hot week in vienn... | 19 | 2823 | No Positive | 0 | 3 | 2.5 | [' Leisure trip ', ' Family with older childre... | 715 day | 48.203745 | 16.335677 |
| 515736 | Wurzbachgasse 21 15 Rudolfsheim F nfhaus 1150 ... | 168 | 8/17/2015 | 8.1 | Atlantis Hotel Vienna | Mexico | No Negative | 0 | 2823 | The rooms are enormous and really comfortable... | 25 | 3 | 8.8 | [' Leisure trip ', ' Group ', ' Standard Tripl... | 717 day | 48.203745 | 16.335677 |
| 515737 | Wurzbachgasse 21 15 Rudolfsheim F nfhaus 1150 ... | 168 | 8/9/2015 | 8.1 | Atlantis Hotel Vienna | Hungary | I was in 3rd floor It didn t work Free Wife | 13 | 2823 | staff was very kind | 6 | 1 | 8.3 | [' Leisure trip ', ' Family with young childre... | 725 day | 48.203745 | 16.335677 |
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 515738 entries, 0 to 515737 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Hotel_Address 515738 non-null object 1 Additional_Number_of_Scoring 515738 non-null int64 2 Review_Date 515738 non-null object 3 Average_Score 515738 non-null float64 4 Hotel_Name 515738 non-null object 5 Reviewer_Nationality 515738 non-null object 6 Negative_Review 515738 non-null object 7 Review_Total_Negative_Word_Counts 515738 non-null int64 8 Total_Number_of_Reviews 515738 non-null int64 9 Positive_Review 515738 non-null object 10 Review_Total_Positive_Word_Counts 515738 non-null int64 11 Total_Number_of_Reviews_Reviewer_Has_Given 515738 non-null int64 12 Reviewer_Score 515738 non-null float64 13 Tags 515738 non-null object 14 days_since_review 515738 non-null object 15 lat 512470 non-null float64 16 lng 512470 non-null float64 dtypes: float64(4), int64(5), object(8) memory usage: 66.9+ MB
data.isnull().sum()
Hotel_Address 0 Additional_Number_of_Scoring 0 Review_Date 0 Average_Score 0 Hotel_Name 0 Reviewer_Nationality 0 Negative_Review 0 Review_Total_Negative_Word_Counts 0 Total_Number_of_Reviews 0 Positive_Review 0 Review_Total_Positive_Word_Counts 0 Total_Number_of_Reviews_Reviewer_Has_Given 0 Reviewer_Score 0 Tags 0 days_since_review 0 lat 3268 lng 3268 dtype: int64
data.dropna(inplace=True,axis=0)
data.isnull().sum()
Hotel_Address 0 Additional_Number_of_Scoring 0 Review_Date 0 Average_Score 0 Hotel_Name 0 Reviewer_Nationality 0 Negative_Review 0 Review_Total_Negative_Word_Counts 0 Total_Number_of_Reviews 0 Positive_Review 0 Review_Total_Positive_Word_Counts 0 Total_Number_of_Reviews_Reviewer_Has_Given 0 Reviewer_Score 0 Tags 0 days_since_review 0 lat 0 lng 0 dtype: int64
data['Negative_Review'].value_counts()
No Negative 127035
Nothing 14227
Nothing 4212
nothing 2211
N A 1032
...
Room wasn t ready rooms freezing hotel basic and outdated 1
not so close to underground 1
There was a terrible smell when you switched on the light in the bathroom 1
A bit far with underground walk more than 5 minutes 1
I was in 3rd floor It didn t work Free Wife 1
Name: Negative_Review, Length: 327927, dtype: int64
data.describe()
| Additional_Number_of_Scoring | Average_Score | Review_Total_Negative_Word_Counts | Total_Number_of_Reviews | Review_Total_Positive_Word_Counts | Total_Number_of_Reviews_Reviewer_Has_Given | Reviewer_Score | lat | lng | |
|---|---|---|---|---|---|---|---|---|---|
| count | 512470.000000 | 512470.000000 | 512470.000000 | 512470.000000 | 512470.000000 | 512470.000000 | 512470.000000 | 512470.000000 | 512470.000000 |
| mean | 500.118391 | 8.397934 | 18.541864 | 2747.504902 | 17.765052 | 7.152272 | 8.395594 | 49.442439 | 2.823803 |
| std | 501.419262 | 0.549133 | 29.693695 | 2322.698454 | 21.789025 | 11.028943 | 1.638170 | 3.466325 | 4.579425 |
| min | 1.000000 | 5.200000 | 0.000000 | 43.000000 | 0.000000 | 1.000000 | 2.500000 | 41.328376 | -0.369758 |
| 25% | 169.000000 | 8.100000 | 2.000000 | 1161.000000 | 5.000000 | 1.000000 | 7.500000 | 48.214662 | -0.143372 |
| 50% | 343.000000 | 8.400000 | 9.000000 | 2134.000000 | 11.000000 | 3.000000 | 8.800000 | 51.499981 | 0.010607 |
| 75% | 666.000000 | 8.800000 | 23.000000 | 3633.000000 | 22.000000 | 8.000000 | 9.600000 | 51.516288 | 4.834443 |
| max | 2682.000000 | 9.800000 | 408.000000 | 16670.000000 | 395.000000 | 355.000000 | 10.000000 | 52.400181 | 16.429233 |
data.describe(include='object').T
| count | unique | top | freq | |
|---|---|---|---|---|
| Hotel_Address | 512470 | 1476 | 163 Marsh Wall Docklands Tower Hamlets London ... | 4789 |
| Review_Date | 512470 | 731 | 8/2/2017 | 2584 |
| Hotel_Name | 512470 | 1475 | Britannia International Hotel Canary Wharf | 4789 |
| Reviewer_Nationality | 512470 | 227 | United Kingdom | 244457 |
| Negative_Review | 512470 | 327927 | No Negative | 127035 |
| Positive_Review | 512470 | 409941 | No Positive | 35737 |
| Tags | 512470 | 54934 | [' Leisure trip ', ' Couple ', ' Double Room '... | 5100 |
| days_since_review | 512470 | 731 | 1 days | 2584 |
data["Hotel_Address"].head(10)
0 s Gravesandestraat 55 Oost 1092 AA Amsterdam ... 1 s Gravesandestraat 55 Oost 1092 AA Amsterdam ... 2 s Gravesandestraat 55 Oost 1092 AA Amsterdam ... 3 s Gravesandestraat 55 Oost 1092 AA Amsterdam ... 4 s Gravesandestraat 55 Oost 1092 AA Amsterdam ... 5 s Gravesandestraat 55 Oost 1092 AA Amsterdam ... 6 s Gravesandestraat 55 Oost 1092 AA Amsterdam ... 7 s Gravesandestraat 55 Oost 1092 AA Amsterdam ... 8 s Gravesandestraat 55 Oost 1092 AA Amsterdam ... 9 s Gravesandestraat 55 Oost 1092 AA Amsterdam ... Name: Hotel_Address, dtype: object
print("Duplicated rows before: ",data.duplicated().sum())
data.drop_duplicates(inplace=True)
print("Duplicated rows after: ",data.duplicated().sum())
Duplicated rows before: 526 Duplicated rows after: 0
data["Hotel_Address"]=data["Hotel_Address"].str.replace("United Kingdom","UK")
data.head()
| Hotel_Address | Additional_Number_of_Scoring | Review_Date | Average_Score | Hotel_Name | Reviewer_Nationality | Negative_Review | Review_Total_Negative_Word_Counts | Total_Number_of_Reviews | Positive_Review | Review_Total_Positive_Word_Counts | Total_Number_of_Reviews_Reviewer_Has_Given | Reviewer_Score | Tags | days_since_review | lat | lng | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | s Gravesandestraat 55 Oost 1092 AA Amsterdam ... | 194 | 8/3/2017 | 7.7 | Hotel Arena | Russia | I am so angry that i made this post available... | 397 | 1403 | Only the park outside of the hotel was beauti... | 11 | 7 | 2.9 | [' Leisure trip ', ' Couple ', ' Duplex Double... | 0 days | 52.360576 | 4.915968 |
| 1 | s Gravesandestraat 55 Oost 1092 AA Amsterdam ... | 194 | 8/3/2017 | 7.7 | Hotel Arena | Ireland | No Negative | 0 | 1403 | No real complaints the hotel was great great ... | 105 | 7 | 7.5 | [' Leisure trip ', ' Couple ', ' Duplex Double... | 0 days | 52.360576 | 4.915968 |
| 2 | s Gravesandestraat 55 Oost 1092 AA Amsterdam ... | 194 | 7/31/2017 | 7.7 | Hotel Arena | Australia | Rooms are nice but for elderly a bit difficul... | 42 | 1403 | Location was good and staff were ok It is cut... | 21 | 9 | 7.1 | [' Leisure trip ', ' Family with young childre... | 3 days | 52.360576 | 4.915968 |
| 3 | s Gravesandestraat 55 Oost 1092 AA Amsterdam ... | 194 | 7/31/2017 | 7.7 | Hotel Arena | United Kingdom | My room was dirty and I was afraid to walk ba... | 210 | 1403 | Great location in nice surroundings the bar a... | 26 | 1 | 3.8 | [' Leisure trip ', ' Solo traveler ', ' Duplex... | 3 days | 52.360576 | 4.915968 |
| 4 | s Gravesandestraat 55 Oost 1092 AA Amsterdam ... | 194 | 7/24/2017 | 7.7 | Hotel Arena | New Zealand | You When I booked with your company on line y... | 140 | 1403 | Amazing location and building Romantic setting | 8 | 3 | 6.7 | [' Leisure trip ', ' Couple ', ' Suite ', ' St... | 10 days | 52.360576 | 4.915968 |
data[data['Average_Score']==9.8].Hotel_Name
54717 Ritz Paris 54718 Ritz Paris 54719 Ritz Paris 54720 Ritz Paris 54721 Ritz Paris 54722 Ritz Paris 54723 Ritz Paris 54724 Ritz Paris 54725 Ritz Paris 54726 Ritz Paris 54727 Ritz Paris 54728 Ritz Paris 54729 Ritz Paris 54730 Ritz Paris 54731 Ritz Paris 54732 Ritz Paris 54733 Ritz Paris 54734 Ritz Paris 54735 Ritz Paris 54736 Ritz Paris 54737 Ritz Paris 54738 Ritz Paris 54739 Ritz Paris 54740 Ritz Paris 54741 Ritz Paris 54742 Ritz Paris 54743 Ritz Paris 54744 Ritz Paris Name: Hotel_Name, dtype: object
plt.figure(figsize=(15,10))
sns.heatmap(data=data.corr(),annot=True)
<AxesSubplot:>
sns.countplot(data=data[data['Reviewer_Score']==10],x=data[data['Reviewer_Score']==10].Reviewer_Nationality.head(200))
plt.xticks(rotation=90)
(array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34]),
[Text(0, 0, ' United Kingdom '),
Text(1, 0, ' Italy '),
Text(2, 0, ' Netherlands '),
Text(3, 0, ' United States of America '),
Text(4, 0, ' Ireland '),
Text(5, 0, ' Australia '),
Text(6, 0, ' Canada '),
Text(7, 0, ' Argentina '),
Text(8, 0, ' France '),
Text(9, 0, ' Russia '),
Text(10, 0, ' Croatia '),
Text(11, 0, ' United Arab Emirates '),
Text(12, 0, ' Panama '),
Text(13, 0, ' New Zealand '),
Text(14, 0, ' Norway '),
Text(15, 0, ' India '),
Text(16, 0, ' Israel '),
Text(17, 0, ' Isle of Man '),
Text(18, 0, ' Liechtenstein '),
Text(19, 0, ' United States Minor Outlying Islands '),
Text(20, 0, ' Morocco '),
Text(21, 0, ' Oman '),
Text(22, 0, ' Germany '),
Text(23, 0, ' Belgium '),
Text(24, 0, ' Spain '),
Text(25, 0, ' China '),
Text(26, 0, ' Greece '),
Text(27, 0, ' Sweden '),
Text(28, 0, ' Taiwan '),
Text(29, 0, ' Lebanon '),
Text(30, 0, ' Thailand '),
Text(31, 0, ' Japan '),
Text(32, 0, ' Turkey '),
Text(33, 0, ' Saudi Arabia '),
Text(34, 0, ' Slovakia ')])
sns.countplot(data=data[data['Reviewer_Score']==2.5],x=data[data['Reviewer_Score']==2.5].Reviewer_Nationality.head(200))
plt.xticks(rotation=90)
(array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37]),
[Text(0, 0, ' United Kingdom '),
Text(1, 0, ' Saudi Arabia '),
Text(2, 0, ' France '),
Text(3, 0, ' United States of America '),
Text(4, 0, ' Netherlands '),
Text(5, 0, ' South Africa '),
Text(6, 0, ' Ireland '),
Text(7, 0, ' Malaysia '),
Text(8, 0, ' Philippines '),
Text(9, 0, ' Fiji '),
Text(10, 0, ' United Arab Emirates '),
Text(11, 0, ' Turkey '),
Text(12, 0, ' Germany '),
Text(13, 0, ' Egypt '),
Text(14, 0, ' Bahrain '),
Text(15, 0, ' Romania '),
Text(16, 0, ' Portugal '),
Text(17, 0, ' Japan '),
Text(18, 0, ' Qatar '),
Text(19, 0, ' Belarus '),
Text(20, 0, ' Spain '),
Text(21, 0, ' Lithuania '),
Text(22, 0, ' Lebanon '),
Text(23, 0, ' Russia '),
Text(24, 0, ' Hong Kong '),
Text(25, 0, ' Namibia '),
Text(26, 0, ' Greece '),
Text(27, 0, ' Kuwait '),
Text(28, 0, ' Vietnam '),
Text(29, 0, ' Australia '),
Text(30, 0, ' Italy '),
Text(31, 0, ' China '),
Text(32, 0, ' Brazil '),
Text(33, 0, ' Ukraine '),
Text(34, 0, ' Belgium '),
Text(35, 0, ' '),
Text(36, 0, ' Nigeria '),
Text(37, 0, ' Indonesia ')])
data['Review_Date']=pd.to_datetime(data['Review_Date'])
data['years']=data['Review_Date'].dt.year
data['months']=data['Review_Date'].dt.month
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 511944 entries, 0 to 515737 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Hotel_Address 511944 non-null object 1 Additional_Number_of_Scoring 511944 non-null int64 2 Review_Date 511944 non-null datetime64[ns] 3 Average_Score 511944 non-null float64 4 Hotel_Name 511944 non-null object 5 Reviewer_Nationality 511944 non-null object 6 Negative_Review 511944 non-null object 7 Review_Total_Negative_Word_Counts 511944 non-null int64 8 Total_Number_of_Reviews 511944 non-null int64 9 Positive_Review 511944 non-null object 10 Review_Total_Positive_Word_Counts 511944 non-null int64 11 Total_Number_of_Reviews_Reviewer_Has_Given 511944 non-null int64 12 Reviewer_Score 511944 non-null float64 13 Tags 511944 non-null object 14 days_since_review 511944 non-null object 15 lat 511944 non-null float64 16 lng 511944 non-null float64 17 years 511944 non-null int64 18 months 511944 non-null int64 dtypes: datetime64[ns](1), float64(4), int64(7), object(7) memory usage: 78.1+ MB
sns.pointplot(data=data,x=data['years'],y=data['Total_Number_of_Reviews'])
plt.xticks(rotation=90)
(array([0, 1, 2]), [Text(0, 0, '2015'), Text(1, 0, '2016'), Text(2, 0, '2017')])
sns.lineplot(data=data,x=data['months'],y=data['Total_Number_of_Reviews'])
<AxesSubplot:xlabel='months', ylabel='Total_Number_of_Reviews'>
sns.lineplot(data=data,x=data['months'],y=data['Review_Total_Negative_Word_Counts'])
<AxesSubplot:xlabel='months', ylabel='Review_Total_Negative_Word_Counts'>
sns.lineplot(data=data,x=data['months'],y=data['Review_Total_Positive_Word_Counts'])
<AxesSubplot:xlabel='months', ylabel='Review_Total_Positive_Word_Counts'>
sns.lineplot(data=data,x=data['months'],y=data['Average_Score'])
<AxesSubplot:xlabel='months', ylabel='Average_Score'>
sns.lineplot(data=data,x=data['months'],y=data['Reviewer_Score'])
<AxesSubplot:xlabel='months', ylabel='Reviewer_Score'>
sns.countplot(data=data,x=data['Average_Score'])
plt.xticks(rotation=90)
(array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33]),
[Text(0, 0, '5.2'),
Text(1, 0, '6.4'),
Text(2, 0, '6.6'),
Text(3, 0, '6.7'),
Text(4, 0, '6.8'),
Text(5, 0, '6.9'),
Text(6, 0, '7.0'),
Text(7, 0, '7.1'),
Text(8, 0, '7.2'),
Text(9, 0, '7.3'),
Text(10, 0, '7.4'),
Text(11, 0, '7.5'),
Text(12, 0, '7.6'),
Text(13, 0, '7.7'),
Text(14, 0, '7.8'),
Text(15, 0, '7.9'),
Text(16, 0, '8.0'),
Text(17, 0, '8.1'),
Text(18, 0, '8.2'),
Text(19, 0, '8.3'),
Text(20, 0, '8.4'),
Text(21, 0, '8.5'),
Text(22, 0, '8.6'),
Text(23, 0, '8.7'),
Text(24, 0, '8.8'),
Text(25, 0, '8.9'),
Text(26, 0, '9.0'),
Text(27, 0, '9.1'),
Text(28, 0, '9.2'),
Text(29, 0, '9.3'),
Text(30, 0, '9.4'),
Text(31, 0, '9.5'),
Text(32, 0, '9.6'),
Text(33, 0, '9.8')])
import plotly.express as px
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
data.Reviewer_Nationality.nunique()
227
# Get the top 10 reviewer nationalities with the most reviews
nationality = data["Reviewer_Nationality"].value_counts(dropna=False)[:10]
# Create a bar chart of the nationalities and review counts
fig = px.bar(x=nationality.index, y=nationality.values, color=nationality.index,
title="Top 10 Nationalities of Reviewers")
fig.update_layout(xaxis_title="Nationality", yaxis_title="Review Count", font=dict(size=14))
fig.show()
data["Hotel_Name"].nunique()
1475
# Get the top 10 hotels with the most reviews
names = data["Hotel_Name"].value_counts(dropna=False)[:10]
fig = px.bar(x=names.index, y=names.values, color=names.index,
title="Top 10 Hotels with the Most Reviews")
fig.update_layout(xaxis_title="Hotel Name", yaxis_title="Review Count", font=dict(size=14))
fig.show()
fig = px.histogram(data, x="Reviewer_Score", title='Review Score Distribution', nbins=20, text_auto=True)
fig.show()
fig = px.histogram(data, x="Average_Score", title='Review Average Score Distribution')
fig.show()
data['Negative_Review'][1]
'No Negative'
data.loc[:, 'Positive_Review'] = data.Positive_Review.apply(lambda x: x.replace('No Positive', ''))
data.loc[:, 'Negative_Review'] = data.Negative_Review.apply(lambda x: x.replace('No Negative', ''))
data['Negative_Review'][1]
''
data["Total_Review"] = data["Negative_Review"] + data["Positive_Review"]
data["review_type"] = data["Reviewer_Score"].apply(
lambda x: "Bad_review" if x < 7 else "Good_review")
df_reviews = data[["Total_Review", "review_type"]]
df_reviews
| Total_Review | review_type | |
|---|---|---|
| 0 | I am so angry that i made this post available... | Bad_review |
| 1 | No real complaints the hotel was great great ... | Good_review |
| 2 | Rooms are nice but for elderly a bit difficul... | Good_review |
| 3 | My room was dirty and I was afraid to walk ba... | Bad_review |
| 4 | You When I booked with your company on line y... | Bad_review |
| ... | ... | ... |
| 515733 | no trolly or staff to help you take the lugga... | Good_review |
| 515734 | The hotel looks like 3 but surely not 4 Brea... | Bad_review |
| 515735 | The ac was useless It was a hot week in vienn... | Bad_review |
| 515736 | The rooms are enormous and really comfortable... | Good_review |
| 515737 | I was in 3rd floor It didn t work Free Wife ... | Good_review |
511944 rows × 2 columns
fig = px.histogram(df_reviews, x="review_type", title='Review Type Distribution', text_auto=True)
fig.show()
df_reviews[df_reviews.review_type == 'Good_review'].Total_Review.value_counts()
Location 940
Nothing Everything 936
Everything 597
Great location 252
Everything 203
...
Outdated hotel rooms a bit shabby arogant receptionists Excellent location 1
Staff unobtrusive but efficient Queries answered in a helpful manner 1
Room Service food was awful but breakfast was good 1
all good location 1
I was in 3rd floor It didn t work Free Wife staff was very kind 1
Name: Total_Review, Length: 411312, dtype: int64
df_reviews[df_reviews.review_type == 'Bad_review'].Total_Review.value_counts()
Everything Nothing 123
Location 105
Nothing 36
location 26
Staff 22
...
The hotel is not four star 1
The staff checking us in was rude and very un polite The breakfast was cold and tasted disgusting 1
Noisy fan so couldn t sleep kettle didn t work cold shower 1
Hotel isn t good marked from the street no window not clear bed clothes dirty mirror good terry 1
The ac was useless It was a hot week in vienna and it only gave more hot air 1
Name: Total_Review, Length: 85115, dtype: int64
Under sample the positive review to achieve a balanced distribution between reviews
good_reviews = df_reviews[df_reviews.review_type == "Good_review"]
bad_reviews = df_reviews[df_reviews.review_type == "Bad_review"]
good_df = good_reviews.sample(n=len(bad_reviews), random_state=42)
df_review_resampled = good_df.append(bad_reviews).reset_index(drop=True)
df_review_resampled.shape
(172350, 2)
df_review_resampled.head()
| Total_Review | review_type | |
|---|---|---|
| 0 | Being really picky here as all was great but ... | Good_review |
| 1 | We were given unbeknown to us a handicap acce... | Good_review |
| 2 | Location a little restrictive Hotel facilities | Good_review |
| 3 | Staff service at the bar was appalling Bed w... | Good_review |
| 4 | No information in rooms about London and some... | Good_review |
df_review_resampled.rename(columns={'Total_Review':'text'}, inplace=True)
sns.countplot(
x='review_type',
data=df_review_resampled,
order=df_review_resampled.review_type.value_counts().index
)
plt.xlabel("type")
plt.title("Review type (resampled)");
def strip_emoji(text):
return emoji.replace_emoji(text,replace="")
def strip_all_entities(text):
text = text.replace('\r', '').replace('\n', ' ').lower()
text = re.sub(r"(?:\@|https?\://)\S+", "", text)
text = re.sub(r'[^\x00-\x7f]',r'', text)
text = re.sub(r'(.)1+', r'1', text)
text = re.sub('[0-9]+', '', text)
stopchars= string.punctuation
table = str.maketrans('', '', stopchars)
text = text.translate(table)
text = ' '.join(text)
return text
def decontract(text):
text = re.sub(r"can\'t", "can not", text)
text = re.sub(r"n\'t", " not", text)
text = re.sub(r"\'re", " are", text)
text = re.sub(r"\'s", " is", text)
text = re.sub(r"\'d", " would", text)
text = re.sub(r"\'ll", " will", text)
text = re.sub(r"\'t", " not", text)
text = re.sub(r"\'ve", " have", text)
text = re.sub(r"\'m", " am", text)
return text
def clean_hashtags(tweet):
new_tweet = " ".join(word.strip() for word in re.split('#(?!(?:hashtag)\b)[\w-]+(?=(?:\s+#[\w-]+)*\s*$)', tweet))
new_tweet2 = " ".join(word.strip() for word in re.split('#|_', new_tweet))
return new_tweet2
def filter_chars(a):
sent = []
for word in a.split(' '):
if ('$' in word) | ('&' in word):
sent.append('')
else:
sent.append(word)
return ' '.join(sent)
def remove_mult_spaces(text):
return re.sub("\s\s+"," ",text)
def stemmer(text):
tokenized = nltk.word_tokenize(text)
ps = PorterStemmer()
return ' '.join([ps.stem(words) for words in tokenized])
def lemmatize(text):
tokenized = nltk.word_tokenize(text)
lm = WordNetLemmatizer()
return ' '.join([lm.lemmatize(words) for words in tokenized])
def preprocess(text):
text = strip_emoji(text)
text = decontract(text)
# text = strip_all_entities(text)
text = clean_hashtags(text)
text = filter_chars(text)
text = remove_mult_spaces(text)
text = stemmer(text)
text = lemmatize(text)
return text
import nltk
nltk.download('omw-1.4')
[nltk_data] Downloading package omw-1.4 to [nltk_data] C:\Users\Vidit\AppData\Roaming\nltk_data... [nltk_data] Package omw-1.4 is already up-to-date!
True
df_review_resampled.head()
| text | review_type | |
|---|---|---|
| 0 | Being really picky here as all was great but ... | Good_review |
| 1 | We were given unbeknown to us a handicap acce... | Good_review |
| 2 | Location a little restrictive Hotel facilities | Good_review |
| 3 | Staff service at the bar was appalling Bed w... | Good_review |
| 4 | No information in rooms about London and some... | Good_review |
df_review_resampled['review_type'].nunique()
2
review_type=['Good_review','Bad_review']
df_review_resampled['cleaned_text']=df_review_resampled['text'].apply(preprocess)
df_review_resampled.head()
| text | review_type | cleaned_text | |
|---|---|---|---|
| 0 | Being really picky here as all was great but ... | Good_review | be realli picki here a all wa great but a coup... |
| 1 | We were given unbeknown to us a handicap acce... | Good_review | we were given unbeknown to u a handicap access... |
| 2 | Location a little restrictive Hotel facilities | Good_review | locat a littl restrict hotel facil |
| 3 | Staff service at the bar was appalling Bed w... | Good_review | staff servic at the bar wa appal bed wa veri c... |
| 4 | No information in rooms about London and some... | Good_review | no inform in room about london and some staff ... |
df_review_resampled["cleaned_text"].duplicated().sum()
4604
df_review_resampled.drop_duplicates("cleaned_text", inplace=True)
df_review_resampled['review_list'] = df_review_resampled['cleaned_text'].apply(word_tokenize)
df_review_resampled.head()
| text | review_type | cleaned_text | review_list | |
|---|---|---|---|---|
| 0 | Being really picky here as all was great but ... | Good_review | be realli picki here a all wa great but a coup... | [be, realli, picki, here, a, all, wa, great, b... |
| 1 | We were given unbeknown to us a handicap acce... | Good_review | we were given unbeknown to u a handicap access... | [we, were, given, unbeknown, to, u, a, handica... |
| 2 | Location a little restrictive Hotel facilities | Good_review | locat a littl restrict hotel facil | [locat, a, littl, restrict, hotel, facil] |
| 3 | Staff service at the bar was appalling Bed w... | Good_review | staff servic at the bar wa appal bed wa veri c... | [staff, servic, at, the, bar, wa, appal, bed, ... |
| 4 | No information in rooms about London and some... | Good_review | no inform in room about london and some staff ... | [no, inform, in, room, about, london, and, som... |
text_len = []
for text in df_review_resampled.review_list:
review_len = len(text)
text_len.append(review_len)
df_review_resampled['text_len'] = text_len
df_review_resampled=df_review_resampled[df_review_resampled['text_len']!=0]
df_review_resampled.shape
(167745, 5)
from sklearn.preprocessing import LabelEncoder
encoder =LabelEncoder()
encoded_review = encoder.fit_transform(df_review_resampled.review_type.values)
encoded_review[1:5]
array([1, 1, 1, 1])
from sklearn.model_selection import train_test_split, GridSearchCV
X_train, X_test, y_train, y_test = train_test_split(
df_review_resampled.text,
encoded_review,
test_size=0.25,
random_state=42
)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
((125808,), (41937,), (125808,), (41937,))
tf_idf = TfidfVectorizer()
X_train_tf = tf_idf.fit_transform(X_train)
X_test_tf = tf_idf.transform(X_test)
print(X_train_tf.shape)
print(X_test_tf.shape)
(125808, 43624) (41937, 43624)
lr=LogisticRegression()
lr_cv_score=cross_val_score(lr,X_train_tf,y_train,cv=5,scoring='f1_macro',n_jobs=-1)
mean_lr_cv = np.mean(lr_cv_score)
mean_lr_cv
0.8260478114536053
lin_svc = LinearSVC()
lin_svc_cv_score = cross_val_score(lin_svc,X_train_tf,y_train,cv=5,scoring='f1_macro',n_jobs=-1)
mean_lin_svc_cv = np.mean(lin_svc_cv_score)
mean_lin_svc_cv
0.8181824314320018
multiNB = MultinomialNB()
multiNB_cv_score = cross_val_score(multiNB,X_train_tf,y_train,cv=5,scoring='f1_macro',n_jobs=-1)
mean_multiNB_cv = np.mean(multiNB_cv_score)
mean_multiNB_cv
0.7969156929095422
dtree = DecisionTreeClassifier()
dtree_cv_score = cross_val_score(dtree,X_train_tf,y_train,cv=5,scoring='f1_macro',n_jobs=-1)
mean_dtree_cv = np.mean(dtree_cv_score)
mean_dtree_cv
0.709411235999091
rand_forest = RandomForestClassifier()
rand_forest_cv_score = cross_val_score(rand_forest,X_train_tf,y_train,cv=5,scoring='f1_macro',n_jobs=-1)
mean_rand_forest_cv = np.mean(rand_forest_cv_score)
mean_rand_forest_cv
0.7973628754332519
adab=AdaBoostClassifier()
adab_cv_score = cross_val_score(adab,X_train_tf,y_train,cv=5,scoring='f1_macro',n_jobs=-1)
mean_adab_cv = np.mean(adab_cv_score)
mean_adab_cv
0.7739219811500005
By trying different models we can see logistic regression and svm performed similarly, so among these we will go with svm model as it is more generalised and light.
svc1 = LinearSVC()
param_grid = {'C':[0.0001,0.001,0.01,0.1,1,10],
'loss':['hinge','squared_hinge'],
'fit_intercept':[True,False]}
grid_search = GridSearchCV(svc1,param_grid,cv=5,scoring='f1_macro',n_jobs=-1,verbose=0,return_train_score=True)
grid_search.fit(X_train_tf,y_train)
GridSearchCV(cv=5, estimator=LinearSVC(), n_jobs=-1,
param_grid={'C': [0.0001, 0.001, 0.01, 0.1, 1, 10],
'fit_intercept': [True, False],
'loss': ['hinge', 'squared_hinge']},
return_train_score=True, scoring='f1_macro')
grid_search.best_estimator_
LinearSVC(C=0.1)
grid_search.best_score_
0.8259393847278773
lin_svc.fit(X_train_tf,y_train)
y_pred = lin_svc.predict(X_test_tf)
def print_confusion_matrix(confusion_matrix, class_names, figsize = (10,7), fontsize=14):
df_cm = pd.DataFrame(confusion_matrix, index=class_names, columns=class_names)
fig = plt.figure(figsize=figsize)
try:
heatmap = sns.heatmap(df_cm, annot=True, fmt="d")
except ValueError:
raise ValueError("Confusion matrix values must be integers.")
heatmap.yaxis.set_ticklabels(heatmap.yaxis.get_ticklabels(), rotation=0, ha='right', fontsize=fontsize)
heatmap.xaxis.set_ticklabels(heatmap.xaxis.get_ticklabels(), rotation=45, ha='right', fontsize=fontsize)
plt.ylabel('Truth')
plt.xlabel('Prediction')
cm = confusion_matrix(y_test,y_pred)
print_confusion_matrix(cm,review_type)
print('Classification Report:\n',classification_report(y_test, y_pred, target_names=review_type))
Classification Report:
precision recall f1-score support
Good_review 0.81 0.83 0.82 21213
Bad_review 0.83 0.80 0.81 20724
accuracy 0.82 41937
macro avg 0.82 0.82 0.82 41937
weighted avg 0.82 0.82 0.82 41937
pickle.dump(tf_idf, open('hotel_reviews.pkl', 'wb'))
pickle.dump(lin_svc, open('hotelreviews.pkl', 'wb'))
Sentiment analysis was done using different ML algorithms including Logistic Regression, Decision Tree, Random Forest, Naive Bayes and SVM.
Maximum accuracy achieved was in the logistic regression & SVM of 82.6% & 82% respectively.